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Abstract 

Sparse learning has recently received increasing attention in many areas includ- 
ing machine learning, statistics, and applied mathematics. The mixed-norm regu- 
larization based on the l\ jl q norm with q > 1 is attractive in many applications of 
regression and classification in that it facilitates group sparsity in the model. The 
resulting optimization problem is, however, challenging to solve due to the struc- 
ture of the £i/^g-regularization. Existing work deals with special cases including 
q = 2, oo, and they can not be easily extended to the general case. In this paper, we 
propose an efficient algorithm based on the accelerated gradient method for solv- 
ing the ^i/^-regularized problem, which is applicable for all values of q larger 
than 1, thus significantly extending existing work. One key building block of the 
proposed algorithm is the l\/l q -regularized Euclidean projection (EPi 9 ). Our the- 
oretical analysis reveals the key properties of EPi 9 and illustrates why EPi q for the 
general q is significantly more challenging to solve than the special cases. Based 
on our theoretical analysis, we develop an efficient algorithm for EPi 9 by solving 
two zero finding problems. Experimental results demonstrate the efficiency of the 
proposed algorithm. 

1 Introduction 

Regularization has played a central role in many machine learning algorithms. The l\- 
regularization has recently received increasing attention, due to its sparsity-inducing 
property, convenient convexity, strong theoretical guarantees, and great empirical suc- 
cess in various applications. A well-known application of the l\ -regularization is the 
Lasso [ 32 1 . Recent studies in areas such as machine learning, statistics, and applied 
mathematics have witnessed growing interests in extending the ^i-regularization to 
the 4/Vregularization G1|7][I1|23]|29]|37][38). This leads to the following t x /t q - 
regularized minimization problem: 

min /(W) = UW) + Xm(W), (1) 

WGRf 

where W £ R p denotes the model parameters, /(■) is a convex loss dependent on 
the training samples and their corresponding responses, W = [wj , wj, . . . , wJ] T 
is divided into s non-overlapping groups, w, G R Pi , i — 1,2 , s, A > is the 
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regularization parameter, and 



^(W) = ^||w i || 9 (2) 

2 = 1 

is the £x/£q norm with || • || 9 denoting the vector £ q norm (q > 1). The £\/£ q - 
regularization belongs to the composite absolute penalties (CAP) [38 1 family. When 
q = 1, the problem (fl} reduces to the l\ -regularized problem. When q > 1, the £\/£ q - 
regularization facilitates group sparsity in the resulting model, which is desirable in 
many applications of regression and classification. 

The practical challenge in the use of the ^ /^-regularization lies in the develop- 
ment of efficient algorithms for solving (Q}, due to the non-smoothness of the £\ j£ q - 
regularization. According to the black-box Complexity Theory 1251I261 . the optimal 
first-order black-box method [25 , 26 1 for solving the class of nonsmooth convex prob- 
lems converges as (9(-^=) (k denotes the number of iterations), which is slow. Existing 
algorithms focus on solving the problem (Q]) or its equivalent constrained version for 
q = 2, oo, and they can not be easily extended to the general case. In order to system- 
atically study the practical performance of the £\ / £ q -regularization family, it is of great 
importance to develop efficient algorithms for solving (Q~|i for any q larger than 1 . 

1.1 First-Order Methods Applicable for ® 

When treating /(•) as the general non-smooth convex function, we can apply the sub- 
gradient descent 1151 l25l l26ll : 

X;+i = Xi - jiGi, (3) 

where Gi € 9/(Xi) is a subgradient of /(•) at X,;, and 7j a step size. There are several 
different types of step size rules, and more details can be found in If5l l25ll . Subgradient 
descent is proven to converge, and it can yield a convergence rate of 0(1/ Vk) for k 
iterations. However, SD has the following two disadvantages: 1) SD converges slowly; 
and 2) the iterates of SD are very rarely at the points of non-differentiability [7|, thus 
it might not achieve the desirable sparse solution (which is usually at the point of non- 
differentiability) within a limited number of iterations. 

Coordinate Descent [ 33 1 and its recent extension — Coordinate Gradient Descent 
(CGD) can be applied for optimizing the non-differentiable composite function 11341 . 
Coordinate descent has been applied for the ^i-norm regularized least squares O, 
£i/£<x> -norm regularized least squares llTBI . and the sparse group Lasso I1 T0I . Coor- 
dinate gradient descent has been applied for the group Lasso logistic regression ll2Tl . 
Convergence results for CD and CGD have been established, when the non-differentiable 
part is separable 11331 l34l . However, there is no global convergence rate for CD and 
CGD (Note, CGD is reported to have a local linear convergence rate under certain 
conditions 1341 Theorem 4]). In addition, it is not clear whether CD and CGD are 
applicable for solving the problem (Q]) with an arbitrary q > 1. 

Fixed Point Continuation |[T2"1[3"T1 was recently proposed for solving the £i-norm 
regularized optimization (i.e., H7(W) = ||W||i). It is based on the following fixed 
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point iteration: 

X i+1 =^ T (X i -rZ'(X i )), (4) 

where "Pj^(W) = sgn(W) max(W — At, 0) is an operator and r > is the step 
size. The fixed point iteration (0]i can be applied to solve <[TJ for any convex penalty 
vj(W), with the operator V™ T (-) being defined as: 

P&CW) = argmin |||X - W||| + Ar^(X). (5) 

The operator V™ T (-) is called the proximal operator [13, 22, 36|, and is guaranteed to 
be non-expansive. With a properly chosen r, the fixed point iteration can converge 
to the fixed point X* satisfying 

X *=P£(X*-tZ'(X*)). (6) 

It follows from (0 and © that, 

OeX* -(X* -Tl'{X*))+\Tdm{X*), (7) 

which together with r > indicates that X* is the optimal solution to (|T). In Il3l l27l . 
the gradient descent method is extended to optimize the composite function in the form 
of (Q~|i, and the iteration step is similar to The extended gradient descent method is 
proven to yield the convergence rate of 0(1/ k) for k iterations. However, as pointed 
out in J3]|27l, the scheme in can be further accelerated for solving £[). 

Finally, there are various online learning algorithms that have been developed for 
dealing with large-scale data, e.g., the truncated gradient method 1151 . the forward- 
looking subgradient (7), and the regularized dual averaging [35 1 (which is based on the 
dual averaging method proposed in [28 1). When applying the aforementioned online 
learning methods for solving ([TJ, a key building block is the operator V^ T (-). 

1.2 Main Contributions 

In this paper, we develop an efficient algorithm for solving the l\ / £ q -regularized prob- 
lem ([TJ, for any q > 1. More specifically, we develop the GLEPi g algorithm^ which 
makes use of the accelerated gradient method J3] |27) for minimizing the composite 
objective functions. GLEPi g has the following two favorable properties: (1) It is appli- 
cable to any smooth convex loss Z(-) (e.g., the least squares loss and the logistic loss) 
and any q > 1. Existing algorithms are mainly focused on £i/£2-regularization and/or 
^i/^oo-regularization. To the best of our knowledge, this is the first work that provides 
an efficient algorithm for solving (UJ with any q > 1; and (2) It achieves a global con- 
vergence rate of O(-p-) (k denotes the number of iterations) for the smooth convex loss 
l(-). In comparison, although the methods proposed in [ 1 , 6S[l6j|29] converge, there is 
no known convergence rate; and the method proposed in ETI has a local linear con- 
vergence rate under certain conditions ll34l Theorem 4] . In addition, these methods are 
not applicable for an arbitrary q > 1. 

'GLEPig stands for Group Sparsity Learning via the ^i/^ 9 -regularized Euclidean Projection. 
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The main technical contribution of this paper is the development of an efficient 
algorithm for computing the £\ /^-regularized Euclidean projection (EPi g ), which is a 
key building block in the proposed GLEPi g algorithm. More specifically, we analyze 
the key theoretical properties of the solution of EPi g , based on which we develop an 
efficient algorithm for EPi g by solving two zero finding problems. In addition, our 
theoretical analysis reveals why EPi g for the general q is significantly more challenging 
than the special cases such as q = 2. We have conducted experimental studies to 
demonstrate the efficiency of the proposed algorithm. 

1.3 Related Work 

We briefly review recent studies on l\ /^-regularization, most of which focus on I1/I2- 
regularization and/or £i/£oo -regularization. 

^1 /-^-Regularization: The group Lasso was proposed in [37| to select the groups 
of variables for prediction in the least squares regression. In 12TI . the idea of group 
lasso was extended for classification by the logistic regression model, and an algorithm 
via the coordinate gradient descent |34| was developed. In [29|, the authors considered 
joint covariate selection for grouped classification by the logistic loss, and developed 
a blockwise boosting Lasso algorithm with the boosted Lasso [39|. In (TJ, the au- 
thors proposed to learn the sparse representations shared across multiple tasks, and 
designed an alternating algorithm. The Spectral projected-gradient (Spg) algorithm 
was proposed for solving the ^i/^-ball constrained smooth optimization problem [4|, 
equipped with an efficient Euclidean projection that has expected linear runtime. The 
£\ /-^-regularized multi-task learning was proposed in [18], and the equivalent smooth 
reformulations were solved by the Nesterov's method 1261 . 

£i/£oo -Regularization: A blockwise coordinate descent algorithm [33] was devel- 
oped for the mutli-task Lasso [16]. It was applied to the neural semantic basis dis- 
covery problem. In l30ll . the authors considered the multi-task learning via the £i/£oo- 
regularization, and proposed to solve the equivalent £\ / ^-ball constrained problem by 
the projected gradient descent. In 1241 . the authors considered the multivariate regres- 
sion via the £\/£oo -regularization, showed that the high-dimensional scaling of £i/£oo- 
regularization is qualitatively similar to that of ordinary £\ -regularization, and revealed 
that, when the overlap parameter is large enough (> 2/3), £\j£oo -regularization yields 
the improved statistical efficiency over £\ -regularization. 

i?i /^-Regularization: In [6|, the authors studied the problem of boosting with 
structural sparsity, and developed several boosting algorithms for regularization penal- 
ties including £±, £00, £\/£2, and £\/£oo- In [38|, the composite absolute penalties 
(CAP) family was introduced, and an algorithm called iCAP was developed. iCAP 
employed the least squares loss and the £\ /£ ao regularization, and was implemented 
by the boosted Lasso 13911 . The multivariate regression with the £\ /^-regularization 
was studied in flTl . In l23l . a unified framework was provided for establishing consis- 
tency and convergence rates for the regularized A/-estimators, and the results for £\ / £ q 
regularization was established. 
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1.4 Notation 



Throughout this paper, scalars are denoted by italic letters, and vectors by bold face let- 
ters. Let X, Y, . . . denote the p-dimensional parameters, Xj, y,, . . . the pi -dimensional 
parameters of the z-th group, and Xi the i-th component of x. We denote q = and 
thus q and q satisfy the following relationship: 4 + ^ = 1. We use the following com- 
ponentwise operators: 0, | • | and sgn(-). Specifically, z = x y denotes zi — xiyi, 
y = |x| denotes yt — \xi\; and y = sgn(x) denotes yi — sgn(xi), where sgn(-) is the 
signum function: sgn(i) = 1 if t > 0; sgn(i) = if t = 0; and sgn(i) = — 1 if t < 0. 

2 The Proposed GLEP^ Algorithm 

In this section, we present the proposed GLEPi 9 algorithm for solving (HJ in the batch 
learning setting. The main technical contribution lies in the development of an efficient 
algorithm for the £i /^-regularized Euclidean projection. Specifically, we analyze the 
key theoretical properties of the projection in Section |ZT1 and show that the projection 
can be computed by solving two zero finding problems in Section [2721 Note that, one 
can develop the online learning algorithm for (Q} using the online learning algorithms 
discussed in the last section, where the £i /^-regularized Euclidean projection is also 
a key building block. 

We first construct the following model for approximating the composite function 
M(-) at the point X (31 E2): 

Mi, x (Y) = [loss(X) + (loss'(X),Y-X)]+Aro(Y) + |||Y-X||2, (8) 

where L > 0. In the model A^l,x(Y), we apply the first-order Taylor expansion at 
the point X (including all terms in the square bracket) for the smooth loss function ?(•), 
and directly put the non-smooth penalty ) into the model. The regularization term 
|| Y — X|| ! prevents Y from walking far away from X, thus the model can be a good 
approximation to /(Y) in the neighborhood of X. 

The accelerated gradient method is based on two sequences {X^} and {Si} in 
which {Xi} is the sequence of approximate solutions, and {Si} is the sequence of 
search points. The search point Si is the affine combination of Xj_i and Xi as 

s i = x 4 + i a i pt < -x i _i), (9) 

where /3; is a properly chosen coefficient. The approximate solution Xj+i is computed 
as the minimizer of Ai l ; .s, (Y): 

X» + i = argndnX L ^ Sl (Y), (10) 

where Li is determined by line search, e.g., the Armijo-Goldstein rule so that Li should 
be appropriate for S,;. 

The algorithm for solving (JTJ is presented in Algorithm Q] GLEPiq inherits the 
optimal convergence rate of 0(1 /k 2 ) from the accelerated gradient method. In Al- 
gorithm [1] a key subroutine is ([Tol l, which can be computed as Xj+x = fl"lg(Sj — 
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Algorithm 1 GLEP 1(? : Group Sparsity Learning via the ^/^-regularized Euclidean 
Projection 

Input: Ai > 0, A 2 > 0, L > 0, X , k 
Output: X k+ i 

1: Initialize Xi = Xo, a_i — 0, ao — 1, and L = Lq. 

2: for i — 1 to k do 

3: SetA = 2 ^if 1 .S i = X < +A(X i -Xi_ 1 ) 
4: Find the smallest L = Lj_i, 2L z _i, . . . such that 

/(X i+1 )<7W i>S4 (X i+1 ), 

where X i+ i = argminY Mi,s 4 (Y) 

5: Set L, = -L and a»+x = — ^-s 

6: end for 



I' (Si)/ Li, X/ Li), where 7Ti g (-) is the /^ g -regularized Euclidean projection (EPi 9 ) 
problem: 

tt 19 (V, A) = arg min |||X - V||| + A^ ||x 4 || 9 . (11) 

2—1 

The efficient computation of (fTTT i for any q > 1 is the main technical contribution of 
this paper. Note that the s groups in (fTTT i are independent. Thus the optimization in ( fTTT i 
decouples into a set of s independent l q -regularized Euclidean projection problems: 

tt 9 (v) = arg min ( g(x) = -||x - v||| + A||x||, ) , (12) 
where n = pi for the i-th group. Next, we study the key properties of ( fT2l . 

2.1 Properties of the Optimal Solution to (fl2b 

The function g(-) is strictly convex, and thus it has a unique minimizer, as summarized 
below: 

Lemma 1 The problem ( 1721 ) has a unique minimizer. 

Next, we show that the optimal solution to ( fT2b is given by zero under a certain 
condition, as summarized in the following theorem: 

Theorem 1 7T g (v) = if and only ifX > \\v\\q. 

Proof: Let us first compute the directional derivative of g(x) at the point 0: 

Dg(0)[u] =lmi-[g(au) - g(0)} = -<v, u) + A||u||„ 

q^o a 

where u is a given direction. According to the Holder's inequality, we have 

|(u,v)| < ||u|U|v|| g -,Vu. 



6 




^ . ■ ■ 1 _n 51 , , . . I 

10 20 30 40 50 10 20 30 40 50 

iteration iteration 



Figure 1: Illustration of the failure of the fixed point iteration x = v A||x|| l~ q xS q ~ 1 ' for 
solving d 1 2b - We set v = [1, 3] T and the starting point x = [1, 3] T . The vertical axis denotes 
the values of Xi during the iterations. 



Therefore, we have 

Dg(0)[u] >0,Vu, (13) 

if and only if A > ||v||g. The result follows, since (fT3l l is the necessary and sufficient 
condition for to be the optimal solution of H2i . □ 
Next, we focus on solving (fT2l i for < A < ||v||g. We first consider solving (fT2l > 
in the case of 1 < q < 00, which is the main technical contribution of this paper. We 
begin with a lemma that summarizes the key properties of the optimal solution to the 
problem (fTZt : 

Lemma 2 Let 1 < q < oo and < A < ||v||g. Then, x* is the optimal solution to the 
problem ( 1721 ) if and if only it satisfies: 

x* +A!|x*||J-«x* (9 - 1) = v, (14) 

where y = x' 9-1 ) is defined component-wisely as: jji — sgn(xi)\xi\ q ~~ 1 . Moreover, 
we have 

7r,(v) = sgn(v) © 7r,(|v|), (15) 

sgn(x*) = sgn(v), (16) 

0<\xt\<\Vi\,Vie{i\VijL0}. (17) 

Proof: Since A < \\v\\q, it follows from Theorem[T]that the optimal solution x* ^ 0. 
||x|| g is differentiable when x ^ 0, so is g(x). Therefore, the sufficient and necessary 
condition for x* to be the solution of ( [T2i > is <?'(x*) = 0, i.e., (fl4] i. Denote c* 
A x " || J - ? > 0. It follows from (O that dT3J holds, and 

Sg n(x*)(\x*\+c*\x*\«- 1 )=v i , (18) 

from which we can verify (fT&b and ( fTTb . □ 
It follows from Lemma |2] that i) if Vi = then x* = 0; and ii) 7r g (v) can be 
easily obtained from 7r g (|v|). Thus, we can restrict our following discussion to v > 0, 
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i.e., Vi > 0, Vi. It is clear that, the analysis can be easily extended to the general v. 
The optimality condition in ( fT4l > indicates that x* might be solved via the fixed point 
iteration 

x = r?(x) = v AHxIIJ-^- 1 ), 

which is, however, not guaranteed to converge (see Figure Q] for examples), as rj(-) 
is not necessarily a contraction mapping [14, Proposition 3]. In addition, x* cannot 
be trivially solved by firstly guessing c = |x||i~ 9 and then finding the root of x + 
Acx^ 9-1 ) = v, as when c increases, the values of x obtained from x + Acx' 9-1 ) = v 
decrease, so that c = ||x|| increases as well (note that 1 — q < 0). 

2.2 Computing the Optimal Solution x* by Zero Finding 

In the following, we show that x* can be obtained by solving two zero finding prob- 
lems. Below, we construct our first auxiliary function h v c (-) and reveal its properties: 

Definition 1 (Auxiliary Function h v c {-) ) Let c > 0, 1 < q < oo, and v > 0. We 

define the auxiliary function h v c {-) as follows: 

h v c (x) = x + cx^ 1 - v,0 < x < v. (19) 

Lemma 3 Let c > 0, 1 < q < oo, and v > 0. Then, h v c {-) has a unique root in the 
interval (0, v). 

Proof: It is clear that h^.(-) is continuous and strictly increasing in the interval [0, v], 
/i"(0) = — v < 0, and h v c {v) = cw 9_1 > 0. According to the Intermediate Value 
Theorem, h v c {-) has a unique root lying in the interval (0, v). This concludes the proof. 

□ 

Corolary 1 Let x, v e M™, c > 0, 1 < p < oo, and v > 0. Then, the function 

<^(x) = x + cx^ 1 ) - v ,0 < x < v (20) 

has a unique root. 

Let x* be the optimal solution satisfying (TT4-b . Denote c* = A||x*||^ 9 . It follows 
from Lemma |2] and Corollary Q] that x* is the unique root of <pZ*(') defined in (l20l . 
provided that the optimal c* is known. Our methodology for computing x* is to first 
compute the optimal c* and then compute x* by computing the root of ip^, (•). Next, we 
show how to compute the optimal c* by solving a single variable zero finding problem. 
We need our second auxiliary function ui(-) defined as follows: 

Definition 2 (Auxiliary Function w(-)) Let 1 < q < oo and v > 0. We define the 
auxiliary function w(-) as follows: 

c = lu(x) = (v - x)/x q ~ 1 ,0 < x < v. (21) 

Lemma 4 In the interval (0, v], c — lo(x) is i) continuously differentiable, ii) strictly 
decreasing, and Hi) invertible. Moreover, in the domain [0, oo), the inverse function 
x = oj _1 (c) is continuously differentiable and strictly decreasing. 



8 



Proof: It is easy to verify that, in the interval (0, v], c = ui(x) is continuously dif- 
ferentiable with a non-positive gradient, i.e., ui'(x) < 0. Therefore, the results follow 
from the Inverse Function Theorem. □ 
It follows from Lemma [4] that given the optimal c* and v, the optimal x* can be 
computed via the inverse function u; _1 (-), i.e., we can represent x* as a function of c* . 
Since A||x* || x ~ q — c* = by the definition of c*, the optimal c* is a root of our third 
auxiliary function </>(•) defined as follows: 

Definition 3 (Auxiliary Function </>(•)) Let 1 < q < oo, 0<A< ||v||q, and v > 0. 
We define the auxiliary function </>(•) as follows: 

0(c) = \ip(c) - c,c > 0, (22) 

where 

l-q 

V(c) = (^i^\c))A ' , (23) 

andijj~ x {c) is the inverse function of 

uii(x) = (vi - x)/x q ~ 1 ,0 < x < Vi. (24) 

Recall that we assume < A < ||v||q (otherwise the optimal solution is given by 
zero from Theorem [TJ. The following lemma summarizes the key properties of the 
auxiliary function (/>(■): 

Lemma 5 Let l<g<oo, 0<A< ||v|| ? , v > 0, and 

c = (IM| ? -A)/||v|| ff . (25) 
Then, </>(•) is continuously differentiable in the interval [0, oo). Moreover, we have 
0(0) = A||v||J-«> 0,0(75) <0, 

where 

c = maxci, (26) 

i 

d = LJi(vie),i = l,2,...,n. (27) 

Proof: From Lemma|H the function w i _1 (c) is continuously differentiable in [0, oo). 
It is easy to verify that w~ 1 (c) > 0, Vc € [0, oo). Thus, </>(•) in d22b is continuously 
differentiable in [0, oo). 

It is clear that 0(0) = A||v|| > 0. Next, we show <p(c) < 0. Since < A < 
||v||g, we have 

< e < 1. (28) 

It follows from d24]i, <|26}, (E) and (|28) that < c t < c, Vi. Let x = [xi, x 2 , . ■ ■ , x„] T 
be the root of ¥?-(•) (see Corollary [T). Then, Xi = Wj(c). Since w^f-) is strictly 
decreasing (see Lemma|4]l, Cj < c, w^e = w~ 1 (ci), and Xi — uj~ 1 (c), we have 

Xi < Vie. (29) 
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Combining (|24| |. d29l i, and c = Ui(xi), we have c > Vi(l — e)/x? 1 , since Wj(-) is 
strictly decreasing. It follows that > ( *Mi~ e ) J 9 1 . Thus, the following holds: 



#9 = ^EK- 1 ®)^ 

which leads to 

0(c) = Xtp(c) — c<c 




l|v|| ff (l-c)' 



A 



.l|v||,(l-e) 

where the last equality follows from 1251 . □ 

Corolary 2 Let l<g<oo, 0<A< ||v||q, v > 0, and c = min; e,, where a 's are 
defined in \27\ . We have < c < c ant/ 0(c) > 0. 

Following Lemma|5]and Corollary [2] we can find at least one root of 0(-) in the 
interval [c, c]. In the following theorem, we show that 0(-) has a unique root: 

Theorem 2 Let 1 < q < oo, 0<A< ||v||g, andv > 0. 77zen, /n [c, c], 0(-) /icw a 
unique root, denoted by c* , and the root oftf^* (•) is the optimal solution to ( 1721 ). 

Proof: From Lemma [5] and Corollary we have 0(c) < and 0(c) > 0. If either 
0(c) = or 0(c) = 0, c or c is a root of 0(-). Otherwise, we have 0(c) 0(c) < 0. As 
0(-) is continuous in [0, oo), we conclude that 0(-) has a root in (c, c) according to the 
Intermediate Value Theorem. 

Next, we show that 0(-) has a unique root in the interval [0, oo). We prove this 
by contradiction. Assume that 0(-) has two roots: < ci < C2- From Corol- 
lary [U <Pci(') an d Pczi') nave un iq ue roots. Denote x 1 = [x\, x\, . . . , x*] T and 
x 2 = [a; 2 , x 2 , . . . , x 2 ] T as the roots of <fci(') an ^ ^ca(")> respectively. We have 
< xj, a; 2 < u i} Vi. It follows from ( I22H24I I that 

x 1 +A||x 1 ||^x l( ^ 1) -v = 0, 

x 2 + A||x 2 ||i-«x 2( ^ 1) -v = 0. 

According to Lemma|2j x 1 and x 2 are the optimal solution of ( fT2b . From Lemma[T] 
we have x 1 = x 2 . However, since x\ — oj i _1 (ci), xf — (02), (•) is a strictly 
decreasing function in [0,oo) by LemmaHJ and c\ < C2, we have x\ > x 2 ,Vi. This 
leads to a contradiction. Therefore, we conclude that 0(-) has a unique root in [c, c]. 

From the above arguments, it is clear that, the root of ip^, (•) is the optimal solution 
to®. □ 

Remark 1 When q = 2, we have c = c = || V | ^_ A - 7f /s easy to verify that 0(c) = 
0(c) =0 anc/ 

tt 2 (v) = h } 2 ~ A v. (30) 

ll V l|2 

Therefore, when q — 2, we obtain a closed-form solution. 
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2.3 Solving the Zero Finding Problem by Bisection 

Let l<g<oo, 0<A< ||v||g, v > 0, v = max^ Uj, v = min^ Vu and 6 > be a 
small constant (e.g., 5 = 10 -8 in our experiments). When q > 2, we have 

and 



When 1 < q < 2, we have 

1 - e , _ 1 - e 

and 



If either 0(c) — or 0(c) = 0, c or c is the unique root of 0(-). Otherwise, we can find 
the unique root of 0(-) by bisection in the interval (c, c), which costs at most 



iterations for achieving an accuracy of 5. Let [ci, c 2 ] be the current interval of uncer- 
tainty, and we have computed oj~[ 1 (ci ) and oj~ 1 (c 2 ) in the previous bisection iterations. 
Setting c = C1 + C2 , we need to evaluate 0(c) by computing cj^ 1 (c), i — 1,2, ... ,n. It 
is easy to verify that (c) is the root of h^^) in the interval (0, vi). Since (•) is 
a strictly decreasing function (see Lemma|4]i, the following holds: 

wf^oa) <^ rl ( c ) <^ rl (ci), 
and thus u^ - (c) can be solved by bisection using at most 

log 2 — ^ < log 2 j < log 2 g 

iterations for achieving an accuracy of 8. For given v, A, and 6, N and v are constant, 
and thus it costs 0(n) for finding the root of 0(-). Once c* , the root of 0(-) is found, 
it costs 0(n) flops to compute x* as the unique root of tp%, (•). Therefore, the overall 
time complexity for solving dTZb is 0(n). 

We have shown how to solve (flZb for 1 < q < oo. For q = 1, the problem (flZb 
is reduced to the one used in the standard Lasso, and it has the following closed-form 
solution J3): 

tti(v) =sgn(v)0max(|v| -A,0). (31) 

For q = oo, the problem ( TTZb can computed via OTb . as summarized in the following 
theorem: 

Theorem 3 Let q = oo, q = 1, one/ < A < ||v||g. 77ien we have 

7r TC (v) = sgn(v)0min(|v|,t*), ( 32 ) 
where t* is the unique root of 

n 

h(t) = y *T t max(\v i \-t,Q)-\. (33) 
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Proof: Making use of the property that HxU^ = max|i y n 1 <x(y,x), we can rewrite 
(12\ in the case of q = oo as 

min max s(x,y) = i||x-v||^ + (y,x). (34) 

x y:||y||i<A 2 

The function s(x, y) is continuously differentiable in both x and y, convex in x and 
concave in y, and the feasible domains are solids. According to the well-known von 
Neumann Lemma ll25l , the min-max problem (134-b has a saddle point, and thus the 
minimization and maximization can be exchanged. Setting the derivative of s(x, y) 
with respect to x to zero, we have 

x = v y. (35) 
Thus we obtain the following problem: 

min i||y-v||i, (36) 
y : lly||i<- >i * 

which is the problem of the Euclidean projection onto the t\ ball (U [6] [20). It has 
been shown that the optimal solution y*to d36t for A < || v|| i can be obtained by first 
computing t* as the unique root of d33l in linear time, and then computing y* as 

y* =sgn(v)©max(|v|-t*,0). (37) 

It follows from (J35> and d37) that d32) holds. □ 

We conclude this section by summarizing the main steps for solving the l q -regularized 
Euclidean projection in Algorithmic 



3 Experiments 

We have conducted experiments to evaluate the efficiency of the proposed algorithm 
using both synthetic and real-world data. We set the regularization parameter as A = 
r x A£,„_, where < r < 1 is the ratio, and AS* „ is the maximal value above which 
the ^i/£ 9 -norm regularized problem (Q} obtains a zero solution (see Theorem[T|i. We 
try the following values for q: 1.25, 1.5, 1.75, 2, 2.33, 3, 5, and oo. The source codes, 
included in the SLEP package 11191 . are available onlinq^- 

3.1 Simulation Studies 

We use the synthetic data to study the effectiveness of the l\ / £ 9 -norm regularization for 
reconstructing the jointly sparse matrix under different values of q > 1, Let A E ]g> mxd 
be a measurement matrix with entries being generated randomly from the standard 
normal distribution, X* E M. dxk be the jointly sparse matrix with the first d < drows 
being nonzero and the remaining rows exactly zero, Y = AX* + Z be the response 
matrix, and Z E R mxfc j s the noise matrix whose entries are drawn randomly from the 

jhttp: / / www . public . asu . edu/ ~ jye02 / Software/ SLEP/| 
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Algorithm 2 Ep q : £ q -regularized Euclidean projection 
Input: A>0,g>l,veK" 

Output: x* = Tr q (v) = argmirixgRr. ±||x - v||§ + A||x|| 9 
1: Compute q = -^j 
2: if ||v||^ < A then 
3: Set x* = 0, return 
4: end if 
5: if q = 1 then 

6: Set x* = sgn(v) max(|v| - A, 0) 

7: else if q = 2 then 

8: Set x* = "T 11 V^ v 
Mb 

9: else if g = oo then 

10: Obtain t* , the unique root of h(t), via the improved bisection method ll20ll 
11: Setx* = sgn(v) 0min(|v|,r) 
12: else 

13: Compute c*, the unique root of 4>(c), via bisection in the interval [c, c] (Theo- 
rem|2]i 

14: Obtain x* as the unique root of <p%* (') 
15: end if 



normal distribution with mean zero and standard deviation a = 0.1. We treat each row 
of X* as a group, and estimate X* from A and Y by solving the following ^i/^-norm 
regularized problem: 

1 - 
X = argirdn -\\AW - Y\\% + 11^%, 

i=l 

where denotes the i-th row of W . We set m = 100, d = 200, and d — k — 50. We 
try two different settings for X*, by drawing its nonzero entries randomly from 1) the 
uniform distribution in the interval [0, 1] and 2) the standard normal distribution. 

We compute the solutions corresponding to a sequence of decreasing values of 
A = r x A*j ax , where r = 0.9 l ~\ for i = 1,2,..., 100. In addition, we use the solution 
corresponding to the 0.9 1 x A^ ax as the "warm" start for 0.9 l+1 x A^ ax . We report the 
results in Figure [2] from which we can observe: 1) the distance between the solution 
X and the truth X* usually decreases with decreasing values of A; 2) for the uniform 
distribution (see the plots in the first row), q = 1.5 performs the best; 3) for the normal 
distribution (see the plots in the second row), q — 1.5, 1.75, 2 and 3 achieve comparable 
performance and perform better than q = 1.25, 5 and oo; 4) with a properly chosen 
threshold, the support of X* can be exactly recovered by the l\ / £ 9 -norm regularization 
with an appropriate value of q, e.g., q = 1.5 for the uniform distribution, and q = 2 
for the normal distribution; and 5) the recovery of X* with nonzero entries drawn 
from the normal distribution is easier than that with entries generated from the uniform 
distribution. 

The existing theoretical results [ 17, 23 1 can not tell which q is the best; and we be- 
lieve that the optimal q depends on the distribution of X* , as indicated from the above 
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Figure 2: Performance of the fi/i'q-norm regularization for reconstructing the jointly sparse 
X* . The nonzero entries of X* are drawn randomly from the uniform distribution for the plots 
in the first row, and from the normal distribution for the plots in the second row. Plots in the first 
two rows show \\X — X* ||_p, the Frobenius norm difference between the solution and the truth; 
and plots in the third row show the ^2-norm of each row of the solution X. 
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results. Therefore, it is necessary to conduct the distribution-specific theoretical stud- 
ies (note that the previous studies usually make no assumption on X*). The proposed 
GLEPi g algorithm shall help verify the theoretical results to be established. 



3.2 Performance on the Letter Data Set 

We apply the proposed GLEPi 9 algorithm for multi-task learning on the Letter data 
set (29J, which consists of 45,679 samples from 8 default tasks of two-class classifi- 
cation problems for the handwritten letters: c/e, g/y, m/n, a/g, i/j, a/o, f/t, h/n. The 
writings were collected from over 180 different writers, with the letters being repre- 
sented by 8 x 16 binary pixel images. We use the least squares loss for £(•). 




Figure 3: Computational time (seconds) comparison between GLEPi 9 (q — 2) and Spg under 
different values of A = r x A~ ax and m. 



3.2.1 Efficiency Comparison with Spg 

We compare GLEPi g with the Spg algorithm proposed in [4]. Spg is a specialized 
solver for the ^i/^-ball constrained optimization problem, and has been shown to 
outperform existing algorithms based on blockwise coordinate descent and projected 
gradient. In Figure [3] we report the computational time under different values of m 
(the number of samples) and A = r x A-? nax (q = 2). It is clear from the plots that 
GLEPi g is much more efficient than Spg, which may attribute to: 1) GLEPi 9 has a 
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m=1 1 420, r=0.01 m=45679, r=0.01 




Figure 4: Computation time (seconds) of GLEPi 9 under different values of m, q and r. 



better convergence rate than Spg; and 2) when q = 2, the EPi g in GLEPi g can be 
computed analytically (see Remark[TJ, while this is not the case in Spg. 

3.2.2 Efficiency under Different Values of q 

We report the computational time (seconds) of GLEPi g under different values of q, 
A = r x A^ ax and m (the number of samples) in Figure |4] We can observe from this 
figure that the computational time of GLEPi g under different values of q (for fixed r 
and m) is comparable. Together with the result on the comparison with Spg for q = 2, 
this experiment shows the promise of GLEPiq for solving large-scale problems for any 
?>1. 

3.2.3 Performance under Different Values of q 

We randomly divide the Letter data into three non-overlapping sets: training, vali- 
dation, and testing. We train the model using the training set, and tune the regu- 
larization parameter A = r x A^ ax on the validation set, where r is chosen from 
{1CT 1 , 5 x 1(T 2 , 2 x 1CT 2 , 1 x 10~ 2 , 5 x 10" 3 , 2 x 1CT 3 , 1 x 10" 3 }. On the testing 
set, we compute the balanced error rate ifTTl . We report the results averaged over 10 
runs in Figure [5] The title of each plot indicates the percentages of samples used for 
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(5%, 5%, 90%) (1 0%, 1 0%, 80%) 




Figure 5: The balanced error rate achieved by the l\ /l q regularization under different values of 
q. The title of each plot indicates the percentages of samples used for training, validation, and 
testing. 

training, validation, and testing. The results show that, on this data set, a smaller value 
of q achieves better performance. 

4 Conclusion 

In this paper, we propose the GLEPi g algorithm for solving the £% /£ q -norm regularized 
problem, for any q > 1. The main technical contribution of this paper is the efficient 
algorithm for the ^i/£ g -norm regularized Euclidean projection (EPi g ), which is a key 
building block of GLEPi 9 . Specifically, we analyze the key theoretical properties of 
the solution of EPi 9 , based on which we develop an efficient algorithm for EPi 9 by 
solving two zero finding problems. Our analysis also reveals why EPi 9 for the general 
q is significantly more challenging than the special cases such as q = 2. 

In this paper, we focus on the efficient implementation of the ^i/£ g -regularized 
problem. We plan to study the effectiveness of the I i/l q regularization under different 
values of q for real-world applications in computer vision and bioinformatics. We also 
plan to conduct the distribution-specific [8 1 theoretical studies for different values of q. 
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